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2  Goals 


The  research  has  focused  on  a  technology  for  “mobile  information  management” .  Underlying  this 
technology  is  a  mathematical  foundation  enabling  the  use  of  formal  methods  in  developing  and 
reasoning  about  the  construction  of  mobile  information  management  components  and  their  use  in 
database  Integration  and  transformation.  The  salient  features  of  our  approach  are: 

1.  Use  of  unmaterialized  views. 

2.  Dynamic  integration  of  data  consumers  and  data  sources,  using  mobile  query  processes. 

3.  Data  interface  specifications,  based  on  XML  schemas. 

4.  Specifications  for  transformations,  based  on  XML  query  languages. 

5.  The  development  of  formal  methods,  focusing  on  query  and  constraint  reformulation. 

3  Project  Summaries 

3.1  Mobile  queries  and  distributed  query  languages 

Sahuguet  and  Tannen  have  worked  on  ubQL  a  new  distributed  query  language  for  programming 
large-scale  distributed  query  systems  such  as  resource  sharing  systems.  The  language  is  obtained 
by  adding  a  small  set  of  mobile  process  primitives  (communication  channels,  migration  operators, 
etc.)  on  top  of  any  traditional  query  language.  Queries  are  encapsulated  into  processes  and  can 
migrate  between  sites  thus  enabling  cooperation.  An  important  methodological  device  Is  the  sep¬ 
aration  of  the  installation  (including  migration)  of  query  processes  from  the  distributed  execution 
of  the  queries.  ubQL  allows  the  encoding  of  widely  used  distributed  query  patterns  such  as  chain- 
ing,  referral,  subscription,  leasing,  recruiting,  query/data/hybrid  shipping,  etc.,  and  evaluate  some 
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language-based  rewrite  strategies  for  the  installation  of  ubQL  queries  that  use  only  partial  and 
distributed  knowledge  of  execution  costs. 

Sahuguet  and  Tannen  (with  Pierce)  have  worked  on  new  mechanisms  in  distributed  query  opti¬ 
mization.  This  work  outlines  a  flexible  framework  for  optimizing  and  deploying  distributed  queries 
in  wide  area  networks.  The  database  field  has  developed  very  powerful  techniques  for  finding  ef¬ 
ficient  execution  plans  for  declaxatively  specified  queries.  However,  applying  these  optimization 
techniques  in  the  setting  of  distributed  information  management  requires  centralized  knowledge  of 
the  entire  network  and  assumes  passive  behavior  from  the  data  sources.  The  reality  of  the  Web 
is  different.  Future  distributed  query  optimizers  must  handle  (in  fact,  exploit!)  a  rich  variety  of 
information  flow  mechanisms  like  chaining,  referral,  proxying,  brokering,  publish-subscribe,  leasing, 
etc.  We  look  to  mobile  agent  technologies  for  the  combination  of  flexibility  and  precision  needed 
for  handling  these  mechanisms.  Our  language-based  approach  uses  a  mobile  process  calculus  based 
on  the  pi-calculus  in  combination  with  a  powerful  query-plan  language.  The  salient  characteristic 
of  the  language  is  that  messaging,  migration,  and  database  operations  all  live  in  the  same  semantic 
space  and  interact,  creating  new  opportunities  for  optimization. 


3.2  Query  reformulation  and  optimization 

Popa  and  Tannen  have  studied  a  class  of  path-conjunctive  queries  and  constraints  (dependencies) 
defined  over  complex  values  with  dictionaries.  This  class  includes  the  relational  conjunctive  queries 
and  embedded  dependencies,  as  well  as  many  interesting  examples  of  complex  value  and  oodb 
queries  and  integrity  constraints.  We  show  that  some  important  classical  results  on  containment, 
dependency  implication,  and  chasing  extend  and  generalize  to  this  class, 

Deutsch,  Popa  and  Tannen  have  continued  the  work  on  an  optimization  method  and  algorithm  de¬ 
signed  for  several  objectives:  physical  data  independence,  using  materialized  views/cached  queries, 
semantic  optimization,  and  generalized  tableau  minimization.  The  method  relies  on  generalized 
forms  of  chase  and  “backchase”  with  constraints  (dependencies) .  By  using  dictionaries  (finite  func¬ 
tions)  in  physical  schemas  we  can  capture  with  constraints  useful  access  structures  such  as  indexes, 
materialized  views,  source  capabilities,  access  support  relations,  gmaps,  etc.  In  this  reporting  pe¬ 
riod,  we  have  shown  that  the  method  is  usable  in  realistic  optimizers  by  extending  it  to  bag  and 
mixed  (i.e.  bag-set)  semantics  as  well  as  to  grouping  views  and  by  showing  how  to  integrate  it 
with  standard  cost-  based  optimization.  We  understand  materialized  views  broadly,  including  user- 
defined  views,  cached  queries  and  physical  access  structures  (such  as  join  indexes,  access  support 
relations,  and  gmaps).  Moreover,  our  internal  query  representation  supports  object  features  hence 
the  method  applies  to  OQL  and  (extended)  SQL:1999  queries.  Chase  and  backchase  supports  a 
very  general  class  of  integrity  constraints,  thus  being  able  to  find  execution  plans  using  views  that 
do  not  fall  in  the  scope  of  other  methods.  In  fact,  we  prove  completeness  theorems  that  show  that 
our  method  will  find  the  best  plan  in  the  presence  of  common  and  practically  important  classes  of 
constraints  and  views,  even  when  bag  and  set  semantics  are  mixed. 

The  search  space  for  query  plans  is  defined  and  enumerated  in  a  novel  manner:  the  chase  phase 
rewrites  the  original  query  into  a  “universal”  plan  that  integrates  all  the  access  structures  and  al¬ 
ternative  pathways  that  are  allowed  by  applicable  constraints.  Then,  the  backchase  phase  produces 
optimal  plans  by  eliminating  various  combinations  of  redundancies,  again  according  to  constraints. 

This  method  is  applicable  (sound)  to  a  large  class  of  queries,  physical  access  structures,  and  semantic 
constraints.  We  prove  that  it  is  in  fact  complete  for  path-conjunctive  queries  and  views  with  complex 
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objects,  classes  and  dictionaries,  going  beyond  previous  theoretical  work  on  processing  queries  using 
materialized  views. 

Popa,  Deutsch,  Sahuguet  and  Tannen  have  studied  the  practicality  this  novel  method  for  generating 
alternative  query  plans  that  uses  chasing  (and  back-chasing)  with  logical  constraints.  The  method 
brings  together  use  of  indexes,  use  of  materialized  views,  semantic  optimization  and  join  elimination 
(minimization).  Each  of  these  techniques  is  known  separately  to  be  beneficial  to  query  optimization. 
The  novelty  of  our  approach  is  in  allowing  these  techniques  to  interact  systematically,  eg.  non-trivial 
use  of  indexes  and  materialized  views  may  be  enabled  only  by  semantic  constraints. 

We  have  implemented  our  method  for  a  variety  of  schemas  and  queries.  We  examine  how  far  we  can 
push  the  method  in  term  of  complexity  of  both  schemas  and  queries.  We  propose  a  technique  for 
reducing  the  size  of  the  search  space  by  ’’stratifying55  the  sets  of  constraints  used  in  the  (back)chase. 
The  experimental  results  demonstrate  that  our  method  is  practical  (i.e,,  feasible  and  worthwhile). 

Kara  and  Davidson  have  studied  functional  dependencies  for  nested  data.  Functional  dependencies 
add  semantics  to  a  database  schema,  and  are  useful  for  studying  various  problems,  such  as  database 
design,  query  optimization  and  how  dependencies  are  carried  into  a  view.  In  the  context  of  a  nested 
relational  model,  these  dependencies  can  be  extended  by  using  path  expressions  instead  of  attribute 
names,  resulting  in  a  class  of  dependencies  that  we  call  nested  functional  dependencies  (NFDs). 
NFDs  define  a  natural  class  of  dependencies  in  complex  data  structures;  in  particular  they  allow 
the  specification  of  many  useful  intra-  and  inter-set  dependencies  (i.e.,  dependencies  that  are  local 
to  a  set  and  dependencies  that  require  consistency  between  sets). 

3.3  XML  and  semistructured  data 

XML  has  become  an  increasingly  popular  data-format  embraced  by  a  lot  of  different  communities. 
XML  is  extremely  attractive  because  it  offers  a  simple,  intuitive  and  uniform  text-based  syntax 
and  is  extensible.  One  can  find  today  XML  proposals  for  messages,  text  content  delivery  and  pre¬ 
sentation,  data  content,  documents,  software  components,  scientific  data,  real-estate  ads,  financial 
products,  cooking  recipes,  etc.  Unfortunately  this  also  means  that  XML  is  far  too  general  and 
if  people  plan  to  use  it  in  serious  applications  (mainly  for  Electronic  Document  Interchange,  in  a 
broad  sense),  they  will  need  to  provide  a  specification  (i.e.  structure,  constraints,  etc.)  for  their 
XML,  which  XML  itself  cannot  offer.  In  order  to  specify  and  enforce  this  structure,  people  have 
been  using  Document  Type  Definitions  (DTDs),  inherited  from  SGML  and  more  recently,  XML 
Schema. 

Buneman,  Davidson,  Fan,  Hara,  and  Tan.  have  investigated  integrity  constraints  for  XML  data. 
Both  DTDs  and  the  XML  Schema  porposal  lack  a  clean  and  general  treatment  of  key  dependencies. 
We  discuss  the  definition  of  keys  for  XML  documents,  paying  particular  attention  to  the  concept 
of  a  relative  key,  which  is  commonly  used  in  hierarchically  structured  documents  and  scientific 
databases.  We  also  investigate  the  (finite)  implication  problems  associated  with  these  dependencies. 
In  contrast  to  other  proposals  of  keys  for  XML,  these  two  classes  of  keys  can  be  reasoned  about 
efficiently.  In  particular,  we  show  that  their  (finite)  implication  problems  are  finitely  axiomatizable 
and  are  decidable  in  polynomial  time. 

Buneman,  Deutsch  and  Tan  have  worked  on  a  deterministic  model  for  semistructured  data  and 
Buneman  and  Pierce  have  worked  on  union  types  for  semistructured  data.  Semistructured  databases 
are  treated  as  dynamically  typed:  they  come  equipped  with  no  independent  schema  or  type  system 
to  constrain  the  data.  Query  languages  that  are  designed  for  semistructured  data,  even  when  used 
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with  structured  data,  typically  ignore  any  type  information  that  may  be  present.  The  consequences 
of  this  are  what  one  would  expect  from  using  a  dynamic  type  system  with  complex  data:  fewer 
guarantees  on  the  correctness  of  applications.  For  example,  a  query  that  would  cause  a  type  error 
in  a  statically  typed  query  language  will  silently  return  the  empty  set  when  applied  to  a  semistruc- 
tured  representation  of  the  same  data.  We  describe  a  system  of  untagged  UNION  TYPES  that  can 
accommodate  variations  in  structure  while  still  allowing  a  degree  of  static  type  checking. 

Sahuguet  has  obtained  some  preliminary  results  that  explore  how  DTDs  are  being  used  for  specify¬ 
ing  the  structure  of  XML  documents.  By  looking  at  some  publicly  available  DTDs,  we  look  at  how 
people  are  actually  (mis)using  DTDs,  show  some  shortcomings,  list  some  requirements  and  discuss 
possible  replacements. 

Liefke  has  worked  on  horizontal  query  optimization  on  ordered  semistructured  data.  The  ex¬ 
change  and  storage  of  XML  data  is  becoming  increasingly  important.  In  contrast  to  conventional 
semistructured  data,  the  labels  in  a  document-oriented  representation  such  as  XML  are  ordered. 
Furthermore,  regular  expressions  (DTDs)  describe  the  horizontal  (and  vertical)  structure.  Con¬ 
ventional  query  languages  for  semi-structured  data  ignore  the  horizontal  order  and  are  therefore 
limited  in  their  expressiveness  and  optimizability.  We  describe  a  query  language  for  querying  or¬ 
dered  semistructured  data.  This  query  language  provides  primitives  for  specifying  more  powerful 
queries  on  ordered  semistructured  data.  Furthermore,  we  describe  how  horizontal  type  information 
in  DTDs  is  used  to  optimize  queries  based  on  finite  automata. 

Liefke  and  Davidson  have  investigated  view  maintenance  for  hierarchical  semistructured  data. 
While  several  important  aspects  of  XML  have  been  investigated,  such  as  query  languages,  type 
systems,  and  storage  models,  the  issue  of  incrementally  maintaining  XML  views  is  largely  un¬ 
studied.  XML  views  differ  from  relational  views  in  two  essential  ways:  1)  There  is  no  rigid  type 
system,  and  2)  The  query  definition  often  performs  complex  restructuring  far  beyond  the  typical 
select-project-join  query  definition  in  relational  views.  We  address  the  problem  of  incrementally 
maintaining  views  over  XML  data  with  key  constraints.  We  describe  a  system  called  WHAX  (Ware¬ 
house  Architecture  for  XML)  that  allows  the  definition  and  incremental  maintenance  of  views  over 
existing  relational  and  XML  data  sources  with  keys.  Our  query  language  supports  important  op¬ 
erations,  such  as  joins,  aggregations,  regrouping,  and  restructuring  operations  such  as  flattening 
We  generalize  several  well-known  results  about  view  maintenance  in  the  relational  model  based  on 
the  notion  of  ” multi-linearity”.  Furthermore,  we  demonstrate  how  incremental  view  maintenance 
improves  the  efficiency  for  XML  views  defined  on  real  XML  data. 

3.4  Data  Provenance  and  Annotation 

Buneman  and  Tan  (with  Khanna)  have  investigated  definitions  and  properties  of  the  data  prove¬ 
nance  concept.  With  the  proliferation  of  database  views  and  curated  databases,  the  issue  of  data 
provenance  -  where  a  piece  of  data  came  from  and  the  process  by  which  it  arrived  in  the  database 
-  is  becoming  increasingly  important,  especially  in  scientific  databases  where  understanding  prove¬ 
nance  is  crucial  to  the  accuracy  and  currency  of  data.  We  describe  an  approach  to  computing 
provenance  when  the  data  of  interest  has  been  created  by  a  database  query.  We  adopt  a  syntactic 
approach  and  present  results  for  a  general  data  model  that  applies  to  relational  databases  as  well 
as  to  hierarchical  data  such  as  XML.  A  novel  aspect  of  our  work  is  a  distinction  between  “why” 
provenance  (refers  to  the  source  data  that  had  some  influence  on  the  existence  of  the  data)  and 
‘‘where’-  provenance  (refers  to  the  location(s)  in  the  source  databases  from  which  the  data  was 
extracted). 


4 


Buneman  and  Tan  (with  Bird)  have  investigated  the  design  of  a  query  language  for  annotation 
graphs.  The  multidimensional,  heterogeneous,  and  temporal  nature  of  speech  databases  raises  in¬ 
teresting  challenges  for  representation  and  query.  Recently,  annotation  graphs  have  been  proposed 
as  a  general-purpose  representational  framework  for  speech  databases.  Typical  queries  on  anno¬ 
tation  graphs  require  path  expressions  similar  to  those  used  in  semistructured  query  languages. 
However,  the  underlying  model  is  rather  different  from  the  customary  graph  models  for  semistruc¬ 
tured  data:  the  graph  is  acyclic  and  unrooted,  and  both  temporal  and  inclusion  relationships  are 
important.  We  develop  a  query  language  and  describe  optimization  techniques  for  an  underlying 
relational  representation. 

3*5  Updates 

Davidson  and  Liefke  have  worked  on  the  problem  of  maintaining  derived  data  in  the  context  of 
database  changes.  “View  maintenance55  describes  the  problem  of  maintaining  a  materialized  view 
while  updating  the  source  database(s).  Updates  to  the  source  database  axe  either  immediately 
propagated  to  the  view  or  are  accumulated  over  time  and  the  view  is  updates  in  frequent  intervals 
(for  instance,  during  night).  “View  update55  is  the  problem  of  propagating  updates  to  the  view  to 
the  source  database. 

They  have  developed  a  generic  update  language,  CPL+,  for  updating  complex  value  databases 
-  databases  containing  values  composed  of  base  values,  sets,  tuples,  and  variants.  The  complex 
value  model  is  a  generalization  of  the  relational  model.  We  propose  various  simplification  and 
optimizations  so  that  an  update  on  a  given  database  is  transformed  into  a  more  efficient  update 
expression.  Further,  they  extended  this  work  to  the  object-oriented  data  model.  A  new  language, 
OQL+,  has  been  developed  to  specify  updates  for  such  databases  in  the  flavor  of  OQL  and  the 
update  primitives  known  from  SQL.  Interesing  issues  such  as  efficient  execution,  non-deterministism 
of  updates,  and  cost-based  optimizations  are  investigated  in  this  project. 


3.6  Integrated  Access  to  Genomic  Data  Sources 

Davidson,  Tannen,  et  al,  have  performed  and  reported  on  experiments  in  applications  of  databases 
to  bioinformatics.  The  integration  of  heterogeneous  data  sources  and  software  systems  is  a  major 
issue  in  the  biomedical  community  and  several  approaches  have  been  explored:  Unking  databases, 
“on-the-fly55  integration  through  views,  and  integration  through  warehousing.  We  report  on  our 
experiences  with  two  systems  that  were  developed  at  the  University  of  Pennsylvania:  an  integration 
system  called  K2,  which  has  primarily  been  used  to  provide  views  over  multiple  external  data  sources 
and  software  systems;  and  a  data  warehouse  called  GUS  which  downloads,  cleans,  integrates  and 
annotates  data  from  multiple  external  data  sources.  Although  the  view  and  warehouse  approaches 
each  have  their  advantages,  there  is  no  clear  “winner55.  Therefore,  users  must  consider  how  the 
data  is  to  be  used,  what  the  performance  guarantees  must  be,  and  how  much  programmer  time 
and  expertise  is  available  to  choose  the  best  strategy  for  a  particular  apphcation.  Our  experiences 
also  point  to  some  practical  tips  on  how  updates  should  be  pubhshed  by  the  community,  and  how 
XML  can  be  used  to  facilitate  the  processing  of  updates  in  a  warehousing  environment. 

Davidson  and  Liefke  (with  Limsoon  Wong)  have  investigated  creating  and  maintaining  curated 
view  databases.  The  process  of  building  a  new  database  relevant  to  some  field  of  study  in  biology 
involves  transforming,  integrating,  and  cleansing  multiple  external  data  sources,  as  well  as  adding 
new  material  and  annotations.  Creating  and  maintaining  these  “view55  databases  raise  a  number 
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of  problems:  1)  How  can  we  specify  and  implement  the  transformation  and  integration  from  the 
underlying  source  databases  to  the  view  database?  2)  How  can  we  automate  the  refresh  process?  3) 
How  can  we  track  the  origins  or  “provenance”  of  data?  The  work  discusses  these  phases  of  creating 
and  maintaining  curated  view  databases  and  contrast  solutions  where  appropriate. 
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